Add bootstrapping #54

oiffrig · 2025-06-05T15:38:57Z

This PR is meant to be a discussion around supporting generic bootstrapping. Please consider this a proof of concept tp get the discussion going.

Goal

The goal is to add a simple interface for bootstrapping score computations

Suggested implementation

This PR creates a decorator called enable_bootstrap. When used, it turns a function into a Bootstrappable object whose __call__ method wraps the function. It adds a bootstrap method that instead perfoms bootstrapping.

Example

EDIT: replaced bad difference example with mse
EDIT2: added more examples

Decorator

@enable_bootstrap
def mse(x, y, axis=-1):
    return np.mean(np.square(y - x), axis=axis)

x = ...
y = ...
mse(x, y)  # normal call, return MSE
bresult = mse.bootstrap(x, y)  # bootstrapping
bresult.quantiles(0.2)
bresult.sig_mask(0.05, 0.0)

Direct call

import numpy as np
from earthkit.meteo.score import pearson

x = ...
y = ...
bootstrapped = bootstrap(pearson, x, y)
np.quantile(bootstrapped, [0.05, 0.95], axis=0)

Direct call with xarray

import xarray as xr

x = xr.DataArray(..., dims=["number", ...])
bootstrapped = bootstrap((lambda arr: arr.mean("number")), x)
bootstrapped.quantile([0.05, 0.95], "sample")

Questions

The interface is so far limited to functions that have two (numpy-like) array inputs and runs bootstrapping on one axis of these arrays. Do we need to extend this?
- Supporting xarray is probably a good idea, then we can bootstrap on named dimensions
- DONE 780c0f3
The BootstrapResult contains the result for every bootstrapping sample. Should we support returning quantiles from bootstrap by default and only return the full set of samples if explicitly requested?
- Alternative: just return the samples and let the user do their own statistics
Is this efficient enough for common use?
Do we need other statistics in the BootstrapResult?
- Standard deviation?
Now that the bootstrap function is available, do we want to keep the decorator?

codecov-commenter · 2025-06-05T15:40:57Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.53%. Comparing base (80c220f) to head (780c0f3).
Report is 4 commits behind head on develop.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop      #54   +/-   ##
========================================
  Coverage    99.53%   99.53%           
========================================
  Files           14       14           
  Lines          866      866           
  Branches        15       15           
========================================
  Hits           862      862           
  Misses           2        2           
  Partials         2        2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

martin-janousek-ecmwf · 2025-06-06T08:52:57Z

Interesting! Would it work for the following use case?
Compute bootstrapped mean of ROC area

Assume an xarray of contingency tables ct["lat","lon","number","date"], where the contingency table elements are stored under the dimension number (for CTs of deterministic forecast the size of this dimension is 4 but for ensemble it's (ens_size+1)*2 (https://sites.ecmwf.int/docs/vtb/annexes.html#annex-2-extended-contingency-table-structure).
Function roc_area(ct,dim="number") returns array roca["lat","lon","date"] (https://github.com/ecmwf/vtb/blob/master/vtb/metricss/xmetrics.py#:~:text=roc_area)
Then to compute a bootstrapped mean of ROC Area along the date dimension one has to
- create sample of ct bootstrapped along the axis date
- for each ct sample compute roca
- compute mean of each roca along date
- return 5-th and 95-th percentiles of roca means

Oisin-M · 2025-06-10T07:39:08Z

@oiffrig this looks very cool!

On this question:

The interface is so far limited to functions that have two (numpy-like) array inputs and runs bootstrapping on one axis of these arrays. Do we need to extend this?

I think the answer is yes. We have functions where the bootstrapping axis isn't necessarily the same e.g. score.crps has as input

x: array-like (n_ens, n_points)
y: array-like (n_points)
It would be a reasonable use-case to want to bootstrap this over n_points, but these are on different axes (though of course can hack it with axis=-1 for now).

We also have some functions that have more than 2 parameters. I give the example of wind.w_from_omega which has as input

omega : array-like
t : array-like
p : array-like

I also had a couple of thoughts/questions I thought I would throw out:

I was actually expecting to return directly the result for every bootstrapped sample, letting the user decide what they want to do after - which could be anything from a .mean(axis=0), quantiles, confidence interval, binning the samples etc. Are we sure that quantiles makes sense as a default?
Do we need some sort of nan_policy to decide whether or not we also bootstrap missing values? I think it might be good to handle it also here.
I think we should raise a warning if the user tries to set n_samples larger than the input data to sample from. I'm not aware of a case where this makes sense.

corentincarton · 2025-06-10T09:15:43Z

Thanks @oiffrig, really interesting! I didn't know about bootstrapping so it's always good to learn :)

I'm a bit puzzled about the approach though. From what I see, the bootstrappable function that you use as an example here does not reduce the dataset in the axis direction, i.e.

f(x(n), y(n)) -> z(n)

Couldn't you then bootstrap the result directly instead of bootstrapping the function? That would make the whole process simpler, and would then work for @Oisin-M and @martin-janousek-ecmwf cases.

In order word, I feel the bootstrappable function should be the statistical function that reduces the array (here the array is the output of the metric function) in one or multiple dimensions. But those bootstrappable functions should not be the metrics that compare the experiment with the observation (CRPS, difference, bias, etc.).

I always visualise these evaluation workflows in three steps:

Align experiment with observations (interpolation, slice, etc.), x(lat, lon), y(n) -> x(n), y(n)
Compute metric between x and y, x(n), y(n) -> z(n)
Compute statistics over one or multiple dimensions (here 1) z(n) -> w

In this case, the bootstrapping is happening at the third level, and should be a type of spatial/temporal statistics. If I understand correctly of course. Please correct me if I'm wrong!

martin-janousek-ecmwf · 2025-06-10T10:15:31Z

I must admit I was initially confused as my experience with bootstrapping was exactly along the lines Corentin very clearly described. But then I thought, oh, maybe the bootstrapping was meant as something even more general, like the difference() function from the initial comment, where the bootstrapping the differences of two arrays was meant to by done by reshuffling one or both operands and then taking the difference? Which I found interesting, even though I personally was not aware of a use case for it.

Oisin-M · 2025-06-10T13:14:05Z

Ah yes, I got confused with the first example. I agree actually the difference function should not be bootstrappable, and neither should the score.crps or wind.w_from_omega examples I gave. The functions that should be bootstrappable are quantiles and sig_mask.

The only edge case I'm not sure about is if a function isn't deterministic, but I don't think we have such a case anyway.

oiffrig · 2025-06-16T16:18:51Z

Yes, I tried to come up with a quick example, and got confused. Do disregard the difference example. If you have some examples of bootstrapping I'd be happy to see them, after offline discussion with @corentincarton and @Oisin-M there seems to be quite a range of use cases, and it's definitely worth considering the extent of what we want to support.

@martin-janousek-ecmwf

Interesting! Would it work for the following use case? Compute bootstrapped mean of ROC area

3. Then to compute a bootstrapped mean of ROC Area along the `date` dimension one has to
   
   * create sample of `ct` bootstrapped along the axis `date`
   * for each `ct` sample compute `roca`
   * compute mean of each `roca` along `date`
   * return 5-th and 95-th percentiles of `roca` means

Not as is, but it looks like something we should be able to support. We would need support for xarray (which looks more and more like the reasonable choice to make anyway), and a looser constraint on number on inputs. Then, the updated bootstrapping would be applied on a function that takes ct and computes the mean of roca:

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
    return roca(ct, dim=roca_dim).mean(dim=mean_dim)

ct = contingency_table()
bresult = roca_mean.bootstrap(ct)
low, high = bresult.quantiles(0.05)

Does that look reasonable to you?

Oisin-M · 2025-06-17T07:37:17Z

Correct me if I'm wrong, but I think we don't want to bootstrap the roca_mean function, but rather just the mean function. It's not necessary/desirable to recalculate ROCA for each bootstrapped time sample, since the ROCA doesn't touch the time axis.

From what I understood, the calculation process would be

1. Get CT

ct = read_ct(data)
# dim:size  [lon: Nx, lat: Ny, number: M, date: T]

2. Calculate ROCA

roca = calculate_roca(ct, dim='number')
# dim:size [lat: Nx, lon: Ny, date: T]

Note: This would be bootstrappable along dimension number, but we only want to bootstrap along dimension time. Therefore, we should not be recalculating ROCA for each sampled time - it should be done once.

3. Bootstrap ROCA output

bootstrapped_roca  = bootstrap_samples(roca, dim='date', sample_size=S, n_bootstraps=B)
# dim:size [lat: Nx, lon: Ny, date: S, sample: B]

4. Calculate mean per timestep

sample_means = calculate_mean(bootstrapped_roca, dim='date')
# dim:size [lat: Nx, lon: Ny, sample: B]

5. Calculate 5th and 95 quantiles over the samples

quantiles = calculate_quantiles(sample_means, q=[0.05, 0.95], dim='sample')
# dim:size [lat: Nx, lon: Ny, quantile: 2]

Sidenote: we could also bootstrap this along the dimension sample because it aggregates along that dimension. In practice, this would be pretty weird - we would essentially be trying to estimate the uncertainty in our estimated uncertainty of the mean.

Assuming I haven't misunderstood things, I think in the snippet therefore it should be rather

@enable_bootstrap(dim="date")
def mean(roca_vals, mean_dim="date"):
    return roca_vals.mean(dim=mean_dim)

ct = contingency_table()
roca_vals = roca(ct, dim="number")
bresult = mean.bootstrap(roca_vals)
low, high = bresult.quantiles(0.05)

if we want to stick with returning this BootstrapResult class. Personally, I would be happier getting directly back the xarray, since the user may want to do some odd calculations afterward. There is also already a .quantile method for xr dataarrays.

martin-janousek-ecmwf · 2025-06-17T08:28:02Z

But ROCA is computed from a mean (or, equally, a sum) of CTs over time. We want to collect CT at each station over a period of time and only then to compute ROCA (and other CT-based scores like PSS, ETS etc).

martin-janousek-ecmwf · 2025-06-17T08:43:57Z

Could it then work like

@enable_bootstrap(dim="date",bdim="sample")
def mean(ct, mean_dim="date"):
    return ct.mean(dim=mean_dim)

ct_vals = contingency_table()
bresult = mean.bootstrap(ct_vals, )
roca_vals = roca(bresult, dim="number")
low, high = roca_vals.quantiles(0.05,dim="sample")

Oisin-M · 2025-06-17T09:37:34Z

Okay I see, thanks for the clarification @martin-janousek-ecmwf. I think then the snippet @oiffrig sent is the correct approach?

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
return roca(ct, dim=roca_dim).mean(dim=mean_dim)
ct = contingency_table()
bresult = roca_mean.bootstrap(ct)
low, high = bresult.quantiles(0.05)

martin-janousek-ecmwf · 2025-06-17T10:03:52Z

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
    return roca(ct, dim=roca_dim).mean(dim=mean_dim)

Sorry, but I still can't see this as correct. This is hinting me we first compute ROCA from station's CT and only then the mean. The order is opposite, first we sum up CTs, and only then we apply roca(). Sum and roca are not commutative.

Wouldn't it rather be

@enable_bootstrap(dim="date")
def roca_mean(ct, roca_dim="number", mean_dim="date"):
    return roca(ct.mean(dim=mean_dim), dim=roca_dim)

Oisin-M · 2025-06-17T11:53:27Z

Thanks for the input @martin-janousek-ecmwf. I agree it's not commutative and the calculation order leads to different interpretations, but I do have some confusion around what we're trying to capture.

Happy to defer to your expertise but doesn't averaging before computing ROCA lose temporal resolution? I would have though we would want to do the ROCA per timestep so that we know the full distribution of ROCA values, and then after we find the mean of that distribution. The other way around i.e computing the ROCA on the mean, would result that we only consider the "mean ct" and don't take in other information about the ct distribution, which I guess is why we would want to bootstrap.

But couldn't we just get the distribution of the roca values per timestep directly and find the mean and quantiles of that?

ct = get_ct()
roca_per_timestep = roca(ct) # distribution of roca values

# just directly find metrics about the distribution
mean = mean(roca_per_timestep, dim='time')
low, high = quantiles(roca_per_timestep, dim='time', qs=[0.05, 0.95])

martin-janousek-ecmwf · 2025-06-17T13:46:59Z

Well, I suspect my "expertise" can be easily challenged as I am not a statistician and here I am just hiding behind the fact "we have been always doing it this way". And ROCA was probably poorly chosen as an example as both the ROCA of a single step of a single forecast and the monthly mean ROCA of a single step make sense. A different score, like the ETS computed from a "deterministic" 2x2 contingency table, would have been a better choice to make my point, as ETS is the case when one has first to accumulate CTs over some time period and only then can compute the derived statistics (ETS).

That nonetheless does not change the fact more often than not we do not use an arithmetic average of "daily" scores to get their average value over a period of time (for the same lead time). We use arithmetic mean for bias, but monthly RMSE is not an average of daily RMSEs but sqrt(mean(RMSE**2)), for ACC we don't average ACCs but their Fisher-transformed values. And monthly means of ROCA, ETS, PSS etc are computed from month-accumulated CTs. And the biggest pain, the skill scores.

Honestly, these are kind of a corner cases, always implemented only later on, but it may be useful to be aware of them as they can break the system.

oiffrig · 2025-06-18T08:50:45Z

Thanks both for your input. I'll try to refine the implementation based on this feedback.

Regarding skill scores, I would be keen on having a look at these use cases anyway, just to see how different they are. It would be great if we could support them out of the box!

oiffrig · 2025-06-24T10:16:02Z

Dear all,

I have now added support for xarray. I have also exposed a few functions, notably resample which returns input arrays for bootstrapping, and bootstrap which does the bootstrapping. I think this should cover a good range of use cases, but please do point me to other examples if you have specific use cases in mind.

Now that we have the bootstrap function, this puts into question the usefulness of the enable_bootstrap decorator. Is there any point in keeping it around? I feel like it would be better to just make the call to bootstrap explicit.

Also, at the moment bootstrap does not return a BootstrapResult. I feel like this is now quite superfluous, especially with xarray supporting a range of statistics out of the box anyway.

Add bootstrapping helper

699e3de

oiffrig added 2 commits June 18, 2025 14:28

Rework bootstrapping helper

7252ba4

Support xarrays in bootstrap

780c0f3

Add bootstrapping #54

Are you sure you want to change the base?

Add bootstrapping #54

Uh oh!

Conversation

oiffrig commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Goal

Suggested implementation

Example

Decorator

Direct call

Direct call with xarray

Questions

Uh oh!

codecov-commenter commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

martin-janousek-ecmwf commented Jun 6, 2025

Uh oh!

Oisin-M commented Jun 10, 2025

Uh oh!

corentincarton commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martin-janousek-ecmwf commented Jun 10, 2025

Uh oh!

Oisin-M commented Jun 10, 2025

Uh oh!

oiffrig commented Jun 16, 2025

Uh oh!

Oisin-M commented Jun 17, 2025

1. Get CT

2. Calculate ROCA

3. Bootstrap ROCA output

4. Calculate mean per timestep

5. Calculate 5th and 95 quantiles over the samples

Uh oh!

martin-janousek-ecmwf commented Jun 17, 2025

Uh oh!

martin-janousek-ecmwf commented Jun 17, 2025

Uh oh!

Oisin-M commented Jun 17, 2025

Uh oh!

martin-janousek-ecmwf commented Jun 17, 2025

Uh oh!

Oisin-M commented Jun 17, 2025

Uh oh!

martin-janousek-ecmwf commented Jun 17, 2025

Uh oh!

oiffrig commented Jun 18, 2025

Uh oh!

oiffrig commented Jun 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

oiffrig commented Jun 5, 2025 •

edited

Loading

codecov-commenter commented Jun 5, 2025 •

edited

Loading

corentincarton commented Jun 10, 2025 •

edited

Loading